The outbreak of the novel Corona virus disease 2019 (COVID-19) was declared a public health emergency of international concern by the World Health Organization (WHO) on January 30, 2020. Upwards of 112 million cases have been confirmed worldwide, with nearly 2.5 million associated deaths. Within the US alone, there have been over 500,000 deaths and upwards of 28 million cases reported. Governments around the world have implemented and suggested a number of policies to lessen the spread of the pandemic, including mask-wearing requirements, travel restrictions, business and school closures, and even stay-at-home orders. The global pandemic has impacted the lives of individuals in countless ways, and though many countries have begun vaccinating individuals, the long-term impact of the virus remains unclear.
The impact of COVID-19 on a given segment of the population appears to vary drastically based on the socioeconomic characteristics of the segment. In particular, differing rates of infection and fatalities have been reported among different racial groups, age groups, and socioeconomic groups. One of the most important metrics for determining the impact of the pandemic is the death rate, which is the proportion of people within the total population that die due to the the disease.
We assemble this dataset for our research with the goal to investigate the effectiveness of lockdown on flattening the COVID curve. We provide a portion of the cleaned dataset for this case study.
There are two main goals for this case study.
Remark: please keep track with the most updated version of this write-up.
The data comes from several different sources:
In this case study, we use the following three cleaned data:
Among all data, the unique identifier of county is FIPS.
The cleaning procedure is attached in Appendix 2: Data cleaning You may go through it if you are interested or would like to make any changes.
First read in the data.
The detailed description of variables is in Appendix 1: Data description. Please get familiar with the variables. Summarize the two data briefly.
It is crucial to decide the right granularity for visualization and analysis. We will compare daily vs weekly total new cases by state and we will see it is hard to interpret daily report.
Plot new COVID cases in NY, WA and FL by state and by day. Any irregular pattern? What is the biggest problem of using single day data?
Create weekly new cases per 100k weekly_case_per100k. Plot the spaghetti plots of weekly_case_per100k by state. Use TotalPopEst2019 as population.
Summarize the COVID case trend among states based on the plot in ii). What could be the possible reasons to explain the variabilities?
(Optional) Use covid_intervention to see whether the effectiveness of lockdown in flattening the curve.
## `summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'State'. You can override using the `.groups` argument.
limits argument in scale_fill_gradient() or use facet_wrap(); use lubridate::month() and lubridate::year() to extract month and year from date; use tidyr::complete(state, month, fill = list(new_case_per100k = NA)) to complete the missing months with no cases.)## `summarise()` has grouped output by 'month'. You can override using the `.groups` argument.
plotly to animate the monthly maps in i). Does it reveal any systematic way to capture the dynamic changes among states? (Hints: Follow Appendix 3: Plotly heatmap:: in Module 6 regularization lecture to plot the heatmap using plotly. Use frame argument in add_trace() for animation. plotly only recognizes abbreviation of state names. Use unique(us_map(regions = "states") %>% select(abbr, full)) to get the abbreviation and merge with the data to get state abbreviation.)We now try to build a good parsimonious model to find possible factors related to death rate on county level. Let us not take time series into account for the moment and use the total number as of Feb 1, 2021.
Create the response variable total_death_per100k as the total of number of COVID deaths per 100k by Feb 1, 2021. We suggest to take log transformation as log_total_death_per100k = log(total_death_per100k + 1). Merge total_death_per100k to county_data for the following analysis.
Select possible variables in county_data as covariates. We provide county_data_sub, a subset variables from county_data, for you to get started. Please add any potential variables as you wish.
Report missing values in your final subset of variables.
In the following anaylsis, you may ignore the missing values.
lambda.1se to choose a smaller model.## Anova Table (Type II tests)
##
## Response: log_death_rate
## Sum Sq Df F value Pr(>F)
## State 500 48 16.67 < 2e-16 ***
## PovertyAllAgesPct 0 1 0.25 0.61737
## PerCapitaInc 1 1 0.90 0.34204
## PctEmpConstruction 28 1 45.44 1.9e-11 ***
## PctEmpMining 7 1 11.05 0.00090 ***
## PctEmpAgriculture 79 1 126.72 < 2e-16 ***
## PctEmpManufacturing 1 1 0.93 0.33503
## PopDensity2010 5 1 7.44 0.00641 **
## Age65AndOlderPct2010 12 1 19.44 1.1e-05 ***
## Under18Pct2010 37 1 59.14 2.0e-14 ***
## Ed3SomeCollegePct 9 1 14.52 0.00014 ***
## Ed5CollegePlusPct 6 1 9.11 0.00256 **
## NetMigrationRate1019 13 1 20.53 6.1e-06 ***
## NaturalChangeRate1019 12 1 19.65 9.6e-06 ***
## WhiteNonHispanicPct2010 2 1 2.99 0.08371 .
## HispanicPct2010 10 1 16.39 5.3e-05 ***
## Type_2015_Update 3 1 5.08 0.02425 *
## Residuals 1899 3040
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Cp or BIC to fine tune the LASSO model from iii). Again force in State in the process.## Anova Table (Type II tests)
##
## Response: log_death_rate
## Sum Sq Df F value Pr(>F)
## State 500 48 16.67 < 2e-16 ***
## PovertyAllAgesPct 0 1 0.25 0.61737
## PerCapitaInc 1 1 0.90 0.34204
## PctEmpConstruction 28 1 45.44 1.9e-11 ***
## PctEmpMining 7 1 11.05 0.00090 ***
## PctEmpAgriculture 79 1 126.72 < 2e-16 ***
## PctEmpManufacturing 1 1 0.93 0.33503
## PopDensity2010 5 1 7.44 0.00641 **
## Age65AndOlderPct2010 12 1 19.44 1.1e-05 ***
## Under18Pct2010 37 1 59.14 2.0e-14 ***
## Ed3SomeCollegePct 9 1 14.52 0.00014 ***
## Ed5CollegePlusPct 6 1 9.11 0.00256 **
## NetMigrationRate1019 13 1 20.53 6.1e-06 ***
## NaturalChangeRate1019 12 1 19.65 9.6e-06 ***
## WhiteNonHispanicPct2010 2 1 2.99 0.08371 .
## HispanicPct2010 10 1 16.39 5.3e-05 ***
## Type_2015_Update 3 1 5.08 0.02425 *
## Residuals 1899 3040
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Anova Table (Type II tests)
##
## Response: log_death_rate
## Sum Sq Df F value Pr(>F)
## State 508 48 16.93 < 2e-16 ***
## PctEmpConstruction 36 1 57.85 3.8e-14 ***
## PctEmpMining 11 1 17.74 2.6e-05 ***
## PctEmpAgriculture 91 1 145.97 < 2e-16 ***
## PopDensity2010 4 1 6.58 0.010 *
## Age65AndOlderPct2010 11 1 18.25 2.0e-05 ***
## Under18Pct2010 38 1 60.85 8.4e-15 ***
## Ed3SomeCollegePct 12 1 19.71 9.3e-06 ***
## Ed5CollegePlusPct 31 1 49.64 2.3e-12 ***
## NetMigrationRate1019 15 1 23.38 1.4e-06 ***
## NaturalChangeRate1019 12 1 19.44 1.1e-05 ***
## WhiteNonHispanicPct2010 4 1 5.73 0.017 *
## HispanicPct2010 10 1 16.61 4.7e-05 ***
## Type_2015_Update 3 1 4.45 0.035 *
## Residuals 1900 3043
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Warning: not plotting observations with leverage one:
## 54
It has been shown that COVID affects elderly the most. It is also claimed that the COVID death rate among African Americans and Latinxs is higher. Does your analysis support these arguments?
Based on your final model, summarize your findings. In particular, summarize the state effect controlling for others. Provide intervention recommendations to policy makers to reduce COVID death rate.
## state coef
## 1: StateND 0.896199
## 2: StateSD 0.789720
## 3: StateMT 0.533683
## 4: StateDC 0.424230
## 5: StateLA 0.352791
## 6: StateMS 0.272436
## 7: StateIA 0.265483
## 8: StateAZ 0.257350
## 9: StateNJ 0.251688
## 10: StateWY 0.181962
## 11: StateIL 0.181870
## 12: StateTX 0.153056
## 13: StateCT 0.111730
## 14: StateAR 0.056081
## 15: StateGA 0.053575
## 16: StateTN 0.052008
## 17: StateMA 0.027635
## 18: StatePA 0.000915
## 19: StateFL -0.006695
## 20: StateSC -0.020896
## 21: StateDE -0.075836
## 22: StateIN -0.079801
## 23: StateMD -0.133003
## 24: StateMI -0.152742
## 25: StateWI -0.153404
## 26: StateMN -0.185672
## 27: StateRI -0.226424
## 28: StateID -0.314283
## 29: StateCO -0.368938
## 30: StateNC -0.376737
## 31: StateOK -0.379630
## 32: StateMO -0.384755
## 33: StateNE -0.407081
## 34: StateNY -0.409466
## 35: StateVA -0.511923
## 36: StateOH -0.543226
## 37: StateNV -0.613914
## 38: StateKS -0.632561
## 39: StateNM -0.671560
## 40: StateWV -0.706377
## 41: StateKY -0.756633
## 42: StateCA -0.820722
## 43: StateNH -0.849123
## 44: StateWA -0.870822
## 45: StateOR -0.873242
## 46: StateUT -1.075081
## 47: StateME -1.394051
## 48: StateVT -2.158665
## state coef